memoize stringly keys of objects #373

cfbolz · 2019-11-13T14:02:27Z

When unpacking msgpack objects, store the keys that appear into a memo
dictionary to make them unique. This is useful, because for most sizable
msgpack files the same keys appear again and again, since many objects
have the same "shape" (set of keys). A similar optimization is done in
most json deserializers, eg in CPython:
https://github.com/python/cpython/blob/d89cea15ad37e873003fc74ec2c77660ab620b00/Modules/_json.c#L717

My totally unscientific results: I tried this on two big msgpack files,
a wikidata dump (92 MiB) and a dump of reddit comments (596 MiB). I am
reporting time spent deserializing and memory use of the resulting data
structure. I've included json deserialization numbers as a comparison.
The results I get on my old-ish laptop are:

wikidata
                      time   memory
CPython 3.7.5 before  3.42s  1279 MiB
CPython 3.7.5 after   3.43s   883 MiB
PyPy3 7.2 before      6.44s  1380 MiB
PyPy3 7.2 after       4.98s   965 MiB

CPython 3.7.5 json    4.13s   887 MiB
PyPy3 7.2 json        3.54s   958 MiB

reddit
CPython 3.7.5 before   5.62s  3412 MiB
CPython 3.7.5 after    5.20s  1754 MiB
PyPy3 7.2 before      14.72s  3782 MiB
PyPy3 7.2 after        8.37s  2086 MiB

CPython 3.7.5 json     8.64s  1753 MiB
PyPy3 7.2 json        10.52s  2052 MiB

For wikidata, there is only a memory improvement on CPython, the time
stays the same. For all other three variants (all of reddit, pypy on
wikidata) both time and memory improve significantly. The reason for the
memory improvements are due to the memoizing, time improves due to
better cache locality due to the smaller working set, and less time
spent in GC in the case of PyPy.

When unpacking msgpack objects, store the keys that appear into a memo dictionary to make them unique. This is useful, because for most sizable msgpack files the same keys appear again and again, since many objects have the same "shape" (set of keys). A similar optimization is done in most json deserializers, eg in CPython: https://github.com/python/cpython/blob/d89cea15ad37e873003fc74ec2c77660ab620b00/Modules/_json.c#L717 My totally unscientific results: I tried this on two big msgpack files, a wikidata dump (92 MiB) and a dump of reddit comments (596 MiB). I am reporting time spent deserializing and memory use of the resulting data structure. I've included json deserialization numbers as a comparison. The results I get on my old-ish laptop are: wikidata time memory CPython 3.7.5 before 3.42s 1279 MiB CPython 3.7.5 after 3.43s 883 MiB PyPy3 7.2 before 6.44s 1380 MiB PyPy3 7.2 after 4.98s 965 MiB CPython 3.7.5 json 4.13s 887 MiB PyPy3 7.2 json 3.54s 958 MiB reddit CPython 3.7.5 before 5.62s 3412 MiB CPython 3.7.5 after 5.20s 1754 MiB PyPy3 7.2 before 14.72s 3782 MiB PyPy3 7.2 after 8.37s 2086 MiB CPython 3.7.5 json 8.64s 1753 MiB PyPy3 7.2 json 10.52s 2052 MiB For wikidata, there is only a memory improvement on CPython, the time stays the same. For all other three variants (all of reddit, pypy on wikidata) both time and memory improve significantly. The reason for the memory improvements are due to the memoizing, time improves due to better cache locality due to the smaller working set, and less time spent in GC in the case of PyPy.

jfolz · 2019-11-13T14:30:34Z

msgpack/unpack.h

        PyErr_Format(PyExc_ValueError, "%.100s is not allowed for map key", Py_TYPE(k)->tp_name);
        return -1;
    }
+    if (PyUnicode_CheckExact(k) || PyBytes_CheckExact(k)) {


Should not make a big difference, but this can be combined with the previous condition in l.194 :)

Thanks for the review! I fixed this.

jfolz · 2019-11-13T14:38:08Z

msgpack/_unpacker.pyx

    ctx.user.encoding = encoding
    ctx.user.unicode_errors = unicode_errors
+    Py_INCREF(d)
+    ctx.user.memo = <PyObject*>d


AFAIK <object>d will enable refcounting in Cython. Otherwise object and PyObject* are identical.

However, I didn't manage to follow this suggestion, because the memo field is in a C struct, which doesn't support object fields.

methane · 2019-11-13T16:51:57Z

I prefer interning because it speed up string comparison too.

jfolz · 2019-11-13T17:07:37Z

Isn't interning a bit dangerous, i.e., those strings are there forever until the interpreter exits?

methane · 2019-11-13T17:13:44Z

PyUnicode_InternImmortal() creates immortal string. It is bit dangerous as you said.
But PyUnicode_InternInplace() doesn't create immortal string.

methane · 2019-11-13T17:24:46Z

BTW, No need to care about Python 2 and bytes object.
Optimize only unicode in Python 3.

jfolz · 2019-11-13T17:47:26Z

Interesting:

Changed in version 2.3: Interned strings are not immortal

So it's been like that for a very long time and that story has just been retold again and again since. PyUnicode_InternImmortal seems to be undocumented - to keep people from using it since it's entirely unnecessary I presume. So interning is exactly as safe and more effective than a memo dict.

cfbolz · 2019-11-13T18:47:33Z

OK, happy to switch to interning in the C version. I would still stick with an explicit dict in the fallback version if that's OK?

jfolz · 2019-11-13T21:01:14Z

Ideally fallback would behave identical to native code. Since you're a PyPy developer I suppose it doesn't play nicely there?

methane · 2019-11-14T07:39:29Z

I would still stick with an explicit dict in the fallback version if that's OK?

What's wrong about sys.intern()?

cfbolz · 2019-11-15T13:30:07Z

Ok, I investigated, seems PyPy is also good about intern not leaking anything, I wasn't sure :-).

cfbolz mentioned this pull request Nov 13, 2019

msgpack unpacking does not memoize string keys #372

Closed

jfolz reviewed Nov 13, 2019

View reviewed changes

express logic more clearly, without repeated checks

3b97d0c

methane closed this Dec 3, 2019

memoize stringly keys of objects #373

memoize stringly keys of objects #373

Uh oh!

Conversation

cfbolz commented Nov 13, 2019

Uh oh!

jfolz Nov 13, 2019

Choose a reason for hiding this comment

Uh oh!

cfbolz Nov 13, 2019

Choose a reason for hiding this comment

Uh oh!

jfolz Nov 13, 2019

Choose a reason for hiding this comment

Uh oh!

cfbolz Nov 13, 2019

Choose a reason for hiding this comment

Uh oh!

methane commented Nov 13, 2019

Uh oh!

jfolz commented Nov 13, 2019

Uh oh!

methane commented Nov 13, 2019

Uh oh!

methane commented Nov 13, 2019

Uh oh!

jfolz commented Nov 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cfbolz commented Nov 13, 2019

Uh oh!

jfolz commented Nov 13, 2019

Uh oh!

methane commented Nov 14, 2019

Uh oh!

cfbolz commented Nov 15, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jfolz commented Nov 13, 2019 •

edited

Loading